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Abstract 

Clustering is widely studied in statistics and machine learning, with 
applications in a variety of fields. As opposed to classical algorithms which 
return a single clustering solution, Bayesian nonparametric models pro¬ 
vide a posterior over the entire space of partitions, allowing one to assess 
statistical properties, such as uncertainty on the number of clusters. How¬ 
ever, an important problem is how to summarize the posterior; the huge 
dimension of partition space and difficulties in visualizing it add to this 
problem. In a Bayesian analysis, the posterior of a real-valued param¬ 
eter of interest is often summarized by reporting a point estimate such 
as the posterior mean along with 95% credible intervals to characterize 
uncertainty. In this paper, we extend these ideas to develop appropriate 
point estimates and credible sets to summarize the posterior of clustering 
structure based on decision and information theoretic techniques. 

Keywords: Mixture model; Random partition; Variation of information; Binder’s 
loss. 


1 Introduction 


Clustering is widely studied in statistics and machine learning, with applications 
in a variety of fields. Numerous models and algorithms for clustering exist, and 
new studies which apply these methods to cluster new datasets or develop novel 
models or algorithms are constantly being produced. Classical algorithms such 


as agglomerative hierarchical clustering or the k-means algorithm (Hartigan 
and Wong 1979]) are popular, but only explore a nested subset of partitions 
or require specifying the number of clusters apriori. Moreover, it is difficult to 
assess statistical properties, such as uncertainty on the number of clusters. 


Bayesian nonparametric clustering or random partition models (Quintana 


2006 ) are becoming increasingly popular, as they overcome many of the draw¬ 


backs of classical algorithms. A Bayesian nonparametric treatment of the clus¬ 
tering problem involves assigning a prior over the space of all possible partitions 
of the data and computing the posterior of the partition given the data. Thus, 
instead of returning a single clustering solution, Bayesian nonparametric models 
provide a posterior over the entire space of clusterings, expressing our belief and 
uncertainty in the clustering structure given the data. 

However, an important problem in Bayesian cluster analysis is how to sum¬ 
marize this posterior; indeed, often the first question one asks is what is an 
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appropriate point estimate of the clustering structure based on the posterior. 
Such a point estimate is useful for concisely representing the posterior and often 
needed in applications. Moreover, a characterization of the uncertainty around 
this point estimate would desirable in many applications. Even in studies of 
Bayesian nonparametric models where the latent partition is used simply as a 
tool to construct flexible models, such as in mixture models for density esti¬ 
mation (Lo 1984 ), it is important to understand the behavior of the latent 
partition to improve understanding of the model. To do so, the researcher 
needs to be equipped with appropriate summary tools for the posterior of the 
partition. 

Inference in Bayesian nonparametric partition models usually relies on Markov 
chain Monte Carlo (MCMC) techniques, which produce a large number of par¬ 
titions that represent approximate samples from the posterior. Due to the huge 
dimension of the partition space and the fact that many of these partitions are 
quite similar differing only in a few data points, the posterior is typically spread 
out across a large number of partitions. Clearly, describing all the unique parti¬ 
tions sampled would be infeasible, further emphasizing the need for appropriate 
summary tools to communicate our findings. 

In a typical Bayesian analysis, the posterior of a univariate parameter of 
interest is often summarized by reporting a point estimate such as the posterior 
mean, median, or mode, along with the 95% credible interval to characterize 
uncertainty. In this paper, we aim to extend these ideas to develop summary 
tools for the posterior on partitions. In particular, we seek to answer the two 
questions: 1) What is an appropriate point estimate of the partition based on 
the posterior? 2) Can we construct a 95% credible region around this point 
estimate to characterize our uncertainty? 

We first focus on the problem of finding an appropriate point estimate. A 
simple solution is to use the posterior mode. If the likelihood of the data given 
the partition and the prior of the partition are available in closed form, the 
posterior mode can be estimated based on the MCMC output by the sampled 
partition which maximizes the non-normalized posterior. In practice, a closed 
form for the likelihood or prior is often unavailable, for example due to the pres¬ 
ence of hyperparameters that cannot be marginalized. In general, the posterior 
mode can be found by reporting the partition visited most frequently in the 
sampler. Yet this approach can be problematic, as producing reliable frequency 
counts is intractable due to the huge dimension of the partition space. In fact, 
in many examples, the MCMC chain does not visit a partition more than once. 
To overcome this, alternative search techniques have been developed to locate 


the posterior mode (Heller and Ghahramani 2005 , Heard et al. 2006 , Dahl 


2009 ). However, it is well known that the mode can be unrepresentative of the 


center of a distribution. 

Alternative methods have been proposed based on the posterior similarity 
matrix. For a sample size of N, the elements of this N by N matrix represent 
the probability that two data points are in the same cluster, which can be 
estimated by the proportion of MCMC samples that cluster the two data points 
together. Then, classical hierarchical or partitioning algorithms are applied 
based on the similarity matrix (Medvedovic and Sivaganesan| 2002 , Medvedovic 


et al. 2004 , Rasmussen et al. 2009 , Molitor et al. 2010]). These methods have 


the disadvantage of being ad-hoc. 

A more elegant solution is based on decision theory. In this case, one de- 


2 







































fines a loss function over clusterings. The optimal point estimate is that which 
minimizes the posterior expectation of the loss function. For example, for a 
real-valued parameter 9, the optimal point estimate is the posterior mean un¬ 
der the squared error loss L 2 (9,9) = (9 — 9) 2 ; the posterior median under the 
absolute error loss Li(9,9) = \Q — 9 1; and the posterior mode under the 0-1 loss 

Lo-i{e,9) = l{9^9). 


The question to answer then becomes what is an appropriate loss function 
on the space of clusterings. The 0-1 loss function, a simple choice which leads 
to the posterior mode as the point estimate, is not ideal as it does not take into 
account the similarity between two clusterings. More general loss functions were 
developed by Binder 1978 , and the so-called Binder’s loss, which measures the 


disagreements in all possible pairs of observations between the true and esti¬ 
mated clusterings, was studied in a Bayesian nonparametric setting by[Lau and 
Alternative loss functions considered in Bayesian nonparametrics 


Green 2007 


can be found in Quintana and Iglesias 2003 and Fritsch and Ickstadt 2009 


In this paper, we propose to use the variation of information developed by 
Meila 2007 as a loss function in a Bayesian nonparametric setting. Both the 
variation of information and Binder’s loss possess the desirable properties of 
being metrics on the space of partitions and being aligned with the lattice of 
partitions. We provide a detailed comparison of these two metrics and discuss 
the advantages of the variation of information over Binder’s loss as a loss func¬ 
tion in Bayesian cluster analysis. Additionally, we propose a novel algorithm to 
locate the optimal partition, taking advantage of the fact that both metrics are 
aligned on the space of partitions. 

Next, to address the problem of characterizing uncertainty around the point 
estimate, we propose to construct a credible ball around the point estimate. As 
both Binder’s loss and the variation of information are metrics on the partition 
space, we can easily construct such a ball. Interestingly, the two metrics can 
produce very different credible balls, and we discuss this in detail. In existing 
literature, quantifications of uncertainty include reporting a heat map of the 
estimated posterior similarity matrix. However, there is no precise quantifica¬ 
tion of how much uncertainty is represented by the posterior similarity matrix, 
and in a comparison with the 95% credible balls, we find that the uncertainty 
is under-represented by the posterior similarity matrix. Finally, we provide an 
algorithm to construct the credible ball and discuss ways to depict or report it. 

The paper is organized as follows. Section [2] provides a review of Bayesian 
nonparametric clustering and existing point estimates of the clustering structure 
from a decision theoretic approach. In Section [3j we give a detailed comparison 
of two loss functions, Binder’s loss and the variation of information, pointing 
out advantages of the latter. The optimal point estimate under the variation of 
information is derived in Section [4] and a novel algorithm to locate the optimal 
partition is proposed. In Section [5] we construct credible balls around the point 
estimate to characterize posterior uncertainty and discuss how to compute and 
depict it. Finally, simulated and real examples are provided in Section [6] 


2 Review 

This section provides a review of Bayesian nonparametric clustering models and 
existing point estimates of the clustering in literature. 
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2.1 Bayesian nonparametric clustering 

Mixture models are one of the most popular modelling tools in Bayesian non- 
parametrics. The data are assumed conditionally i.i.d. with density 

f{y\P) = J K{y\0)dP{6), 

where K(y\9 ) is a parametric density on the sample space with parameter 9. A 
nonparametric prior is placed on the mixing measure P, and typically this prior 
has discrete realizations almost surely (a.s.). In this case, 

OO 

P = ^WjSe- a.s., 

3 = 1 

where it is often assumed that the weights (wj) and atoms ( Oj) are independent 
and the 9j are i.i.d. from some base measure Pq. Thus, the density is modelled 
with a countably infinite mixture model 

OO 

f(y\P)=Y,^ K m)- 

3 = 1 

Since P is discrete a.s., this model induces a latent partitioning c of the 
data where two data points belong to the same cluster if they are generated 
from the same mixture component. The partition can be represented by c = 
(Ci,..., Ck N ), where Cj contains the indices of data points in the j th cluster and 
fcjv is the number of clusters in the sample of size N. Alternatively, the partition 
can be represented by c = (ci,.. ., cjv), where c n = j if the n th data point is 
in the j th cluster. An advantage of the Bayesian nonparametric approach is 
that the number of clusters Un is determined by and can grow with the data. 
Marginalizing over the random probability measure, the data y\-.N is modelled 
as 

/(yi:Jv|c) = n m (yj) = II / n K(y n \0)d,Po(9), 

3 =1 J =1 neCj 

where y j = {y^n^Cy 

The posterior of the partition, which reflects our beliefs and uncertainty in 
the clustering given the data, is simply proportional to the likelihood times the 
prior 


p(c\vi:n) oc 


(1) 


where the prior of the partition is obtained from the selected prior on the mixing 
measure. For example, a Dirichlet process prior (Ferguson 1973 ) for P with 
mass parameter a corresponds to 


P ( c ) 


r(q) 

r (a + N) 


a 




3=1 
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where rij is the number of data points in cluster j. Various other priors de¬ 
veloped in Bayesian nonparametric literature can be considered for the mixing 


measure P, such as the two-parameter Poisson-Dirichlet process (Pitman and 
Yor 1997|) or the normalized generalized Gamma process or more generally, 


the class of Poisson-Kingman models (Pitman 2003|), the class of normalized 


completely random measures, or the class of stick-breaking priors (Ishwaran and 


James 2001 ). See Lijoi and Priinster 2011 for an overview. 


In general, the likelihood or the prior may not be available in closed form. 
Moreover, there are 




3=0 


(k - j) 


N 


a Stirling number of the second kind, ways to partition the N data points in to 
k groups and 


N 


Bn = ^ Sjv,fc> 


fc=i 


a Bell number, possible partitions of the N data points. Even for small N, this 
number is very large, which makes computation of the posterior intractable for 
the simplest choice of prior and likelihood. Thus, MCMC techniques are typ¬ 
ically employed, such as the marginal samplers described by Neal 2000] with 


extensions in 


Favaro and Teh 2013 or the conditional samplers described in 


Ishwaran and James 

2001 , 

Kalli et al. 

2011 , or 

Papaspiliopoulos and Roberts 


terior 0 - Clearly, describing all the posterior samples is infeasible, and our aim 
is to develop appropriate summary tools to characterize the posterior. 

Extensions of Bayesian nonparametric mixture models are numerous and al¬ 
low one to model increasingly complex data. These include extensions for par¬ 
tially exchangeable data (Teh et al. 2006]), inclusion of covariates (MacEachern 


2000]), time dependent data (Griffin and Steel 2006|), and spatially dependent 


data (|Duan et al~ 2007 ) to name a few. See Muller and Quintana [2004] and 


Dunson [2010 for an overview. These extensions also induce latent clustering(s) 
of the observations, and the summary tools developed here are applicable for 
these settings as well. 


2.2 Point estimation for clustering 

Firstly, we seek a point estimate of the clustering that is representative of the 
posterior, which may be of direct interest to the researcher or, more generally, 
important for understanding the behavior of the posterior. From decision theory, 
a point estimate is obtained by specifying a loss function L( c, c), which measures 
the loss of estimating the true clustering c with c. Since the true clustering is 
unknown, the loss is averaged across all possible true clusterings, where the 
loss associated to each potential true clustering is weighted by its posterior 
probability. The point estimate c* corresponds to the estimate which minimizes 
the posterior expected loss, 

c* = argmin E[L(c, c) |s/i : jv] = argminj^ L(c, c)p(c\y 1:N ). 
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A simple choice for the loss function is the 0-1 loss, Lo-i(c,c) = l(c 7 ^ c), 
which assumes a loss of 0 if the estimate is equal to the truth and a loss of 1 
otherwise. Under the 0-1 loss, the optimal point estimate is the posterior mode: 


= argmaxp(c|yi : jv). 


However, this loss function is unsatisfactory because it doesn’t take into account 
similarity between two clusterings; a partition which differs from the truth in 
the allocation of only one observation is penalized the same as a partition which 
differs from the truth in the allocation of many observations. Moreover, it is well 
known that the mode can be unrepresentative of the center of a distribution. 
Thus, more general loss functions are needed. 

However, constructing a more general loss is not straightforward because, 
as point out by Binder 1978 , the loss function should satisfy basic principles 


such as invariance to permutations of the data point indices and invariance to 
permutations of the cluster labels for both the true and estimated clusterings. 


Binder notes that this first condition implies that the loss is a function of the 
counts riij, which count the number of data points in cluster i under c and 
cluster j under c for i = 1 ,..., fcjv and j = 1 ,..., fcjv; the notation fcjv and 
k n represents the number of clusters in c and c, respectively. He explores 
loss functions satisfying these principles, starting with simple functions of the 
counts riij. The so-called Binder’s loss is a quadratic function of the counts, 
which for all possible pairs of observations, penalizes the two errors of allocating 
two observations to different clusters when they should be in the same cluster 
or allocating them to the same cluster when they should be in different clusters: 


B(c, c) 




^ll(Cn — C n ')l(c n 7^ C n f) + ^2l(Cn 7^ C n ')l(c n — C n '). 


If the two types of errors are penalized equally, h = h = 1, then 


1 


fcjv 


kN k N 


B ( c ’c) = 2 HC n i+ + J 2 n +i n h 

V=1 j =1 * =1 i=i 

where rii + = n ij and n +j = J2i n ij- Under Binder’s loss with li = l 2 , the 
optimal partition c* is the partition c which minimizes 


'y' 11 (c n — c n ') p n n i 1, 


or equivalently, the partition c which minimizes 

y ' (1 (Cn = C n ') Pnn' ) 


( 2 ) 


n<n' 


where p nn > = P(c n = c n '\y\-N) is the posterior probability that two observations 
n and n! are clustered together. This loss function was first studied in Bayesian 
nonparametrics by Lau and Green 2007| . We note that in earlier work Dahl 
2006 considered minimization of Q but without the connection to Binder’s 


loss and the decision theoretic approach. 
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Binder’s loss counts the total number of disagreements, D , in the ( ,J) possi¬ 
ble pairs of observations. The Rand index (Rand 1971 ), a cluster comparison 
criterion, is defined as the number of agreements, A 1 in all possible pairs divided 
by the total number of possible pairs. Since D + A = (^), Binder’s loss and the 
Rand index, denoted R(c,c), are related: 


N 


B(c,c) = (1 - R(c, c)) 


and the point estimate obtained from minimizing the posterior expected Binder’s 
loss is equivalent to the point estimate obtained from maximizing the posterior 
expected Rand’s index. Motivated by this connection, Fri tsch and Ickstadt 
2009 consider maximizing the adjusted Rand index, introduced by Hubert and 


Arabie 1985 to correct the Rand index for chance. An alternative loss function 


is explored by Quintana and Iglesias 2003 specifically for the problem of outlier 
detection. 


A comparison of the variation of information 
and Binder’s loss 


Meila 2007 introduces the variation of information (VI) for cluster comparison, 


which is constructed from information theory and compares the information in 
two clusterings with the information shared between the two clusterings. More 
formally, the VI is defined as 


VI(c, c) = H(c) + H(c) - 2I(c, c) 

k n kj\ 




i—1 


1=1 


i= 1 1 = 1 


UjjN 
Tli j 


where log denotes log base 2. The first two terms represent the entropy of the 
two clusterings, which measures the uncertainty in bits of the cluster allocation 
of a unknown randomly chosen data point given a particular clustering of the 
data points. The last term is the mutual information between the two clusterings 
and measures the reduction in the uncertainty of the cluster allocation of a data 
point in c when we are told its cluster allocation in c. The VI ranges from 0 
to log(lV). A review of extensions of the VI to normalize or correct for chance 

2010| . However, some desirable properties of the 


are discussed in Vinh et al. 


VI are lost under these extensions. 

In this paper, we propose to use the VI as a loss function. Note that since 


I(c,c) = H(c) + H(c) - H(c, c), 


we can write 


VI(c, c) = H(c) + H(c) - 2H(c) - 2H(c) + 2H(c,c), 
= —H(c) — H(c) + 2H(c, c), 

2=1 j = 1 


&1V 

2EZ 


t=11=1 
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Figure 1: Hasse diagram for the lattice of partitions with a sample of size TV = 4. 
A line is drawn from c up to c when c is covered by c. 


We provide a detailed comparison with an TV-invariant version of Binder’s loss, 
defined as 


B ( c ,c) = = 




i= 1 


3 =1 


m - 


kN kN 


EE(f) ■ 


*=i j'=i 


Both loss functions are considered TV-invariant as they only depend on TV through 
the proportions n t j /TV. We focus on these two loss functions as they satisfy sev¬ 
eral desirable properties. 

The first important property is that both VI and B are metrics on the space 
of partitions. 

Property 3.1 Both VI and B are metrics on the space of partitions, that is 
they satisfy: 

1. Non-negativity: d(c,c) > 0 and d( c,c) = 0 if and only if c = c under a 
permutation of cluster labels, 

2. Symmetry: d( c,c) =d(c,c), 

3. Triangle inequality: for any c,c,c, 

d(c,c) < d( c,c) + d( c,c). 


A proof for VI can be found in 


Meila 


2007 


For B, the proof results from 


the fact that B can be derived as the Hamming distance between the binary 
representation of the clusterings. 

The next properties involve first viewing the space of partitions as a partially 
ordered set. In particular, consider the space of partitions C and the binary 
relation < on C defined by set containment, i.e. for c,c € C, c < c if for 
all* = 1,... ,fcjv, Ci C Cj for some j G {1,... ,fcjv}. The partition space C 
equipped with < is a partially ordered set, meaning that the following properties 
are satisfied: 


1. Reflexivity: c < c, 

2. Antisymmetry: if c < c and c < c then c = c under a permutation of 
cluster labels, 
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3. Transitivity: if c < c and c < c, then c < c. 

For any c,c £ C, c is covered by c, denoted c -< c, if c < c and there is no 
c £ C such that c < c < c. This covering relation is used to define the Hasse 
diagram , where the elements of C are represented as nodes of a graph and a 
line is drawn from c up to c when c -< c. An example of the Hasse diagram for 
N = 4 is depicted in Figure [T] 

The space of partitions possesses an even richer structure; it forms a lattice. 
This follows from the fact that every pair of partitions has a greatest lower 
bound (g.l.b.) and least upper bound (l.u.b.), where an element c £ C is an 
upper bound for a subset S C C if s < c for all s £ S, and c £ C is the 
least upper bound, if it exists, for a subset S C C if c is an upper bound for S 
and c < c' for all upper bounds c' of S (a lower bound and the greatest lower 
bound are similarly defined). In general, a partially ordered set satisfying these 
properties forms a lattice; that is, for any c,c, c £ C 


1. c A c = c and c V c = c, 

2. c A c = c A c and c V c = c V c, 

3. c A (c A c) = (c A c) A c and c V (c V c) = (c V c) V c, 

4. c A (c V c) = c and c V (c A c) = c, 


where the operators A and V are defined as c A c = g.l.b. (c, c) and cVc = 
l.u.b.(c,c) and equality holds under a permutation of cluster labels. In this 
case, the operator A is called the meet and the operator V is called the join. 
Following the conventions of lattice theory, we will use 1 to denote the greatest 
element of the lattice of partitions, i.e. the partition with every observation in 
one cluster c = ({1,... ,7V}), and 0 to denote the least element of the lattice 
of partitions, i.e. the partition with every observation in their own cluster 
c = ({1},..., {IV}). See Nation 1991 for more details on lattice theory. 

A desirable property is that both VI and B are aligned with the lattice of 
partitions. Specifically, both metrics are vertically aligned in the Hasse diagram; 
if c is connected up to c and c is connected up to c, then the distance between 
c and c is the vertical sum of the distances between c and c and between c and 
c (see Property 3.2). And, both metrics are horizontally aligned ; the distance 
between any two partitions is the horizontal sum of the distances between each 
partition and the meet of the two partitions (see Property 3.3). 


Property 3.2 For both VI and B, if c > c > c, then 


d( c,c) = d( c,c) + d( c,c). 
Property 3.3 For both VI and B, 


d{ c, c) = d{ c, c A c) + d{ c, c A c). 


Proofs can be found in the Appendix. These two properties imply that if 
the Hasse diagram is stretched to reflect the distance between any partition 
and 1, the distance between any two partitions can be easily determined from 
the stretched Hasse diagram. Figures [2] and [3] depict the Hasse diagram for 
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Figure 2: Hasse diagram stretched by VI with a sample of size N = 4. Note 
2 — | log(3) ps 0.811. From the VI streched Hasse diagram, we can determine 
the distance between any two partitions. Example: if c = ({1, 2}, {3,4}) and 
c = ({1}, {3}, {2,4}), then cAc = ({1}, {2}, {3}, {4}) and d( c, c) = d( cAc, 1) — 
d( c, 1) + d {c A c, 1) - d( c, 1) = 2 - 1 + 2 - 1.5 = 1.5. 



Figure 3: Hasse diagram stretched by B with a sample of size N = 4. From 
the B streched Hasse diagram, we can determine the distance between any 
two partitions. Example: if c = ({1, 2}, {3,4}) and c = ({1}, {3}, {2,4}), then 
cAc = ({1}, {2}, {3}, {4}) and d(c,c) = d( cAc, l)—d(c, l)+d(cAc, 1) — d(c, 1) = 
0.75 - 0.5 + 0.75 - 0.625 = 0.375. 


N = 4 in Figure [l] stretched according VI and B respectively. As an example, 
consider c = ({1, 2}, {3,4}) and c = ({1}, {3}, {2,4}); their meet is c Ac = 
({1}, {2}, {3}, {4}), and the VI distance is VI(c, c) = VI(c A c, 1) — VI(c, 1) + 
VI(c A c, 1) - VI(c, 1) = 2 - 1 + 2 - 1.5 = 1.5. 

From the stretched Hasse diagram, we gain several insights into the similar¬ 
ities and differences between the two metrics. An evident difference is the scale 
of the two diagrams. 

Property 3.4 A distance on partitions satisfying Properties \3.2\ and \3.3\ has 
the property that for any two partitions c and c, 

d( c, c) < d( 1, 0). 


Thus, 


VI( c,c) < log(IV) 


and B( c,c) < 1 


1 

N' 
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In both cases, the bound on the distance between two clusterings depends 
on the sample size N. However, the behavior of this bound is very different; for 
VI, it approaches infinity as N —► oo, and for B, it approaches one as N —> oo. 
As N grows, the number of total partitions Bn increases drastically. Thus, it 
is sensible that the bound on the metric grows as the size of the space grows. 
In particular, 1 and 0 become more distant are N —► oo, as there are increasing 
number, Bn — 2, of partitions between these two extremes; for B, the loss of 
estimating one of these extremes with the other approaches the fixed number 
one, while for VI, the loss approaches infinity. 

From the stretched Hasse diagram in Figures [2] and |3j we can determine 
the closest partitions to any c. For example, the closest partitions to 1 are 
the partitions which split 1 into two clusters, one singleton and one containing 
all other observations; and the closest partitions to ({1}, {2}, {3,4}) are the 
partition which merges the two smallest clusters ({1,2}, {3,4}) and the partition 
which splits the cluster of size 2 ({1}, {2}, {3}, {4}). 

Property 3.5 For both metrics VI and B, the closest partitions to a partition 
c are: 

• if c contains at least two clusters of size one and at least one cluster of 
size two, the partitions which merge any two clusters of size one and the 
partitions which split any cluster of size two. 

• if c contains at least two clusters of size one and no clusters of size two, 
the partitions which merge any two clusters of size one. 

• if c contains at most one cluster of size one, the partitions which split the 
smallest cluster of size greater than one into a singleton and a cluster with 
the remaining observations of the original cluster. 

This property characterizes the set of estimated partitions which are given 
the smallest loss. Under both loss functions, the smallest loss of zero occurs 
when the estimated partition is equal to the truth. Otherwise, the smallest 
loss occurs when the estimated clustering differs from the truth by merging two 
singleton clusters or splitting a cluster of size two, or, if neither is possible, 
splitting the smallest cluster of size n into a singleton and a cluster of size n — 1. 
We further note that the loss of estimating the true clustering with a clustering 
which merges two singletons or splits a cluster of size two, is ^ and for VI 
and B respectively, which converges to 0 as N —► oo for both metrics, but at a 
faster rate for B. 

Next, we note that the Hasse diagram stretched by B in Figure [ 3 ] appears 
asymmetric, in the sense that 1 is more separated from the others when com¬ 
pared to the Hasse diagram stretched by VI in Figure [2j 

Property 3.6 Suppose N is divisible by k, and let c *. denote a partition with 
k clusters of equal size N/k. 

B(l,Cfc)=l-i>i--i:=B(0,Cfc). 

VI{ 1, c fc ) = log(fc) < log(IV) - log (k) = V7(0, Cfc), for k < VN, 

and 

VI(1, c fc ) = log(fc) > log(IV) - log(Ie) = V7(0, c fe ), for k > VN. 
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Figure 4: Example of the VI ball around c = ({1, 2}, {3,4}), with the rainbow 
color indicating increasing distance from c. The smallest non-trivial credible 
ball contains all the red clusterings, the next smallest contains the red and 
orange clusterings, and so on. 
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Figure 5: Example of the B ball around c = ({1, 2}, {3,4}), with the rainbow 
color indicating increasing distance from c. The smallest non-trivial credible 
ball contains all the red clusterings, the next smallest contains the red and 
orange clusterings, and so on. 

Property |3.6| reflects the asymmetry apparent in Figure [3j In particular, for 
B, a partition with two clusters of equal size c 2 will always be closer to the 
extreme 0 of everyone in their own cluster than the extreme 1 of everyone in 
one cluster. However, as the sample size increases, c 2 becomes equally distant 
between the two extremes. For all other values of k, the extreme 0 will always 
be closer. This behavior is counter-intuitive for a loss function on clusterings. 
VI is much more sensible in this regard. If k = y/N, 0 and 1 are equally good 
estimates of c. For k < y/N , c*. is is better estimated by 1 and for k > y/N, C/~ 
is better estimated by 0; as the sample size increases, these preferences become 
stronger. In particular, note that loss of estimating c 2 with 1 will always be 
smaller than estimating it with 0 for N > 4. 

Additionally, we observe from Figure [3] that the partitions with two clusters 
of sizes 1 and 3 are equally distant between the two extremes under B. The 
following property generalizes this observation. 

Property 3.7 Suppose N is an even and square integer. Then, the partitions 
with two clusters of sizes n = |(1V — y/N) and N — n are equally distance from 
1 and 0 under B. 
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This property is unappealing for a loss function, as it states that the loss 
of estimating a partition consisting of two clusters of sizes ^(N — y/~N) and 
i(iV + \fN) with the partition of only one cluster or with the partition of all 
singletons is the same. Intuitively, however, 1 is a better estimate. The behavior 
of VI is much more reasonable, as partitions with two clusters will always be 
better estimated by 1 than 0 for N > 4 and partitions with y/~N clusters of 
equal size are equally distant from 0 and 1. 

Finally, we note that as both VI and B are metrics on the space of clusterings, 
we can construct a ball around c of size e, defined as: 

B e ( c) = {c € C : d( c,c) < e}. 

Interestingly, the balls differ between the two metrics even for the simple ex¬ 
ample with N = 4 in Figures [4] and [5j which consider a ball around c = 
({I, 2}, {3,4}). In these figures, partitions are rainbow colored by increasing 
distance to c; thus, the smallest non-trivial ball, i.e. the smallest ball around c 
with at least two partitions, contains all red clusterings, the next smallest ball 
contains all red and orange clusterings, and so on. In general, from Property 
|3.5[ the smallest non-trivial ball will be the same for the two metrics, which 
is confirmed in the figures, as the set of red clusterings coincide. When con¬ 
sidering the next smallest ball, differences emerge; the VI ball includes the red 
clusterings and the orange clusterings 0 and 1, and the B ball includes the red 
clusterings and the orange clustering 0. Note that 1 is only included in the B 
ball around c when it is expanded to include all clusterings and is considered as 
distant to c as the partitions c = ({1, 3}, {2,4}) and c = ({1,4}, {2, 3}). In the 
authors’ opinions, the VI ball more closely reflects our intuition of the closest 
set of partitions to c. 

4 Point estimation via the variation of informa¬ 
tion 

As detailed in the previous section, both VI and B share several desirable prop¬ 
erties including being aligned with the lattice of partitions and coinciding in the 
smallest non-trivial ball around any clustering. However, in our comparison, 
differences also emerged. Particularly, we find that B exhibits some peculiar 
asymmetries, preferring to split clusters over merging, and we find that the VI 
ball more closely reflects our intuition of the neighborhood of a partition. In 
light of this, we propose to use VI as loss function in Bayesian cluster analysis. 
Under the VI, the optimal partition c* is 

c* = argminE[VI(c, c)|T>] 

C 

N N N N 

= argmin ^ log( ^ l(c n / = c n )) - 2 ^ E[log( ^ l(c n > = c n ,c n > = c n ))\V\, 

c n—1 n '—1 n=1 n'—l 

(3) 

with T> denoting the data. For a given c, the second term in ([3]) can be approx¬ 
imated based on the MCMC output. This, however, may be computationally 
demanding if the number of MCMC samples is large and if must be eval¬ 
uated for a large number of c. Alternatively, one can use Jensen’s inequality, 
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swapping the log and expectation, to obtain a lower bound on the expected loss 
which is computationally efficient to evaluate: 


N N N N 

argmin^ log( ^ l(c„, = c n )) - 2 ^ log( ^ P(c n , =c„|£>)l(c n / = c n )). 


n—1 n' — l 


n =1 n' = l 


(4) 


Similar to minimization of the posterior expected Binder’s loss, minimization 
of Q only depends on the posterior through the posterior similarity matrix, 
which can pre-computed based on the MCMC output. 

Due to the huge dimensions of the partition space, computing 0 or 0 for 
every possible c is practically impossible. An simple technique to find the opti¬ 
mal partition c* restricts the search space to some smaller space of partitions, 
for example, the partitions visited in the MCMC chain. Alternative algorithms 
have been developed in Quintana and Iglcsias 2003 and Lau and Green 2007| . 

We propose a greedy search algorithm to locate the optimal partition c* 
based on the Hasse diagram, which can be used to for both VI and B. In 
particular, given some partition c, we consider the l closest partitions that 
cover c and the l closest partitions that c covers. Here, the distance used to 
determine the closest partitions corresponds to the selected loss of VI or B. 
Next, the posterior expected loss is computed for all proposed partitions and 
we move in the direction of minimum posterior expected loss. The algorithm 
stops when no reduction in the posterior expected loss is obtained or when a 
maximum number of iterations has been reached. In practice, we may initialize 
the algorithm with say a particular sample of the MCMC, e.g. the last sample, 
or with the MCMC sample which minimizes the posterior expected loss. The 
algorithm is summarized below. 

Greedy search algorithm based on the Hasse diagram: 


Initialize c. 


• For i = 1,... I 

— Find the l closest partitions that cover c and the l closest partitions 
that c covers. 

— Compute E[L(c,c)|23] for all 2 1 partitions and select the partition c' 
with minimal E[L(c, c')\V\. 

— If E[L(c, c')[D] < E[L(c, c)\V], set c = c'. Otherwise, STOP. 

• end 


5 Credible balls of partitions 

To characterize the uncertainty in the point estimate c* , we propose to construct 
a credible ball of a given credible level 1 — a, a £ [0,1], defined as 

= {c : d{ c*,c) < e*}, 

where e* is the smallest e > 0 such that 

P(B e (c*)\V) > 1 - a. 
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The credible ball is the smallest ball around c* with posterior probability at 
least 1 — a. It reflects the posterior uncertainty in the point estimate c*; with 
probability 1 — a, we believe that the clustering is within a distance of e* from 
the point estimate c* given the data. It can be defined based on any metric on 
the space of partitions, such as VI and B. If the smallest non-trivial ball under 
VI or B has posterior probability of at least 1 — a, the credible balls under the 


two metrics will coincide (see Property 3.5). Typically, however, they will be 


different, see the discussion at the end of Section [3j 

From the MCMC output, we can obtain an estimate of e*, and thus the 
credible ball of level 1 — a. First, the distance between all MCMC samples 
{c m } and c* is computed. For any e > 0, 


M 


P(B e (c*)\V) =E[l(d(c*,c) < e)\V] « ^ £ l(d(c*,c m ) < e), 

m= 1 

and e* is the smallest e > 0 such that 


i M 

mEW.c 

m =1 


“) < e) > 1 - a. 


To characterize the credible ball, we define the vertical and horizontal bounds 
based on the credible ball. The vertical upper bounds consist of the partitions 
in the credible ball with the smallest number of clusters which are most distant 
from c*. The vertical lower bounds consist of the partitions in the credible 
ball with the largest number of clusters which are most distant from c*. The 
horizontal bounds consist of the partitions in the credible ball which are most 
distant from c*. 

For example, suppose N = 4 and the optimal point estimate under both 
VI and B is ({1, 2}, {3,4}). Further suppose that the 95% credible ball with 
the VI and B metric consists of the red, orange, and yellow partitions depicted 
in Figures [4] and [5j respectively. For VI, the upper vertical bound is 1, the 
lower vertical bound is 0, and the horizontal bounds are the yellow partitions, 
i.e. those with one singleton and one cluster of size N — 1 = 3. For B, the 
upper vertical bounds are the partitions with one singleton and one cluster of 
size N — 1 = 3, the lower vertical bound is 0, and the horizontal bounds are the 
yellow partitions, i.e. those with one singleton and one cluster of size N — 1 = 3 
and those with two singletons and one cluster of size TV — 2 = 2 which are not 
covered by c*. 

In practice, we define the vertical and horizontal bounds based on the par¬ 
titions in the credible ball with positive estimated posterior probability. 

In existing literature, quantification of uncertainty in the clustering struc¬ 
ture is typically described through a the heat map of the estimated posterior 
similarity matrix. However, as opposed to the credible ball of Bayesian confi¬ 
dence level 1 — a, there is no precise quantification of how much uncertainty 
is represented by the posterior similarity matrix. Moreover, in the examples of 
Section [6j we find that in a comparison with the 95% credible balls, the un¬ 
certainty is under-represented by the posterior similarity matrix. Additionally, 
the credible balls have the added desirable interpretation of characterizing the 
uncertainty around the point estimate c*. 
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(a) Example 1: 4 clusters 


(b) Example 2: 4 clusters 


Figure 6: The data are simulated from a mixture of four normals with locations 
(±2, ±2)' and colored by cluster membership. In (a) the standard deviations 
of all components are equal to 1, and in (b) the standard deviations are 1 in 
the first and third quadrants, 0.5 in second quadrant, and 1.5 in the fourth 
quandrant. 


6 Examples 

We provide both simulated and real examples to compare the point estimates 
from VI and Binder’s loss and describe the credible ball representing uncertainty 
in the clustering estimate. 


6.1 Simulated examples 

Two datasets of size n = 200 are simulated from: 





(-1)L^J2 


3 

0 



For the first example, a , = 1 for all components, and for the second, crj = 1 
for the two components located in the first and third quadrants, aj = 0.5 in 
the second quadrant, and crj = 1.5 in the fourth quadrant. The data for both 
examples are depicted in Figure [6] and colored by cluster membership. 

We consider a Dirichlet process (DP) mixture model: 


Xi\P 


iid 



mi 

M2 


P ~ DP(aPo)) 


erf 0 

0 ct 2 


dP{n, £), 


(5) 


where ^ = (fj, i, ^ 2 )' and E is a diagonal matrix with diagonal elements (erf, cr|). 
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(a) Binder’s loss: 9 clusters (b) VI: 4 clusters (c) Modified VI: 4 clusters 

Figure 7: Example 1: optimal clustering estimate with color representing cluster 
membership for Binder’s loss, VI, and the modified VI. 


EBL(c)= 0.08804 Vl(c)= 1.03 Vl(c)= 1.033 
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(a) Binder’s loss: 12 clusters (b) VI: 5 clusters (c) Modified VI: 4 clusters 

Figure 8: Example 2: optimal clustering estimate with color representing cluster 
membership for Binder’s loss, VI, and the modified VI. 



E[BD] 

B(ct,c*) 

E [VI |2?] 

VI(c t , c*) 

c* from B 

0.062 

0.045 

0.816 

0.643 

c* from VI 

0.064 

0.044 

0.77 

0.569 

c* from mVI 

0.064 

0.049 

0.779 

0.620 


Table 1: Example 1: a comparison of the optimal partition with B, VI, or the 
modified VI in terms of expected B, B between the optimal and true clusterings, 
expected VI, and VI between the optimal and true clusterings. 



E[B D] 

B(c t ,c*) 

E[VI|D] 

VI(c t , c*) 

c* from B 

0.088 

0.056 

1.068 

0.764 

c* from VI 

0.099 

0.056 

1.03 

0.646 

c* from mVI 

0.099 

0.062 

1.033 

0.648 


Table 2: Example 2: a comparison of the optimal partition with B, VI, or the 
modified VI in terms of expected B, B between the optimal and true clusterings, 
expected VI, and VI between the optimal and true clusterings. 
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(a) B estimate: 9 clusters (b) B upper vertical bound: (c) B lower vertical bound: 

4 clusters 18 clusters 
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(d) B horizontal bounds: 8 and 11 clusters 

Figure 9: Example 1: 95% credible ball with Binder’s loss around c* (a) repre¬ 
sented by (b) the upper vertical bound, (c) the lower vertical bound, and (d) 
the horizontal bounds, where color denotes cluster membership. 


The base measure of the DP is conjugate product of normal inverse gamma 
priors with parameters (/Uo,i, c*, oq, bi) for i = 1,2, i.e. Pq has density 


p 0 (Mi,M2,o-i,cr|) oc n/| ex P (-^2 (ft -Mo,*) 2 ) (of)" 

i =1 V i \ i / 


exp 


The parameters were fixed to [io,i = 0, c* = 1/2, a,; = 2, bi = 1 for i = 1,2. The 
mass parameter a is given a Gam(l, 1) hyperprior. 

A marginal Gibbs sampler is used for inference ([Neal |2000|) with 10,000 iter¬ 
ations after a burn in period of 1,000 iterations. Trace plots and autocorrelation 
plots (not shown) suggest convergence. 

Among partitions sampled in the MCMC, only one is visited twice and all 
others are visited once in the first example, while no partitions are visited more 
than once in the second. Thus, an estimate of posterior mode based on frequency 
counts is not reliable. 

For the first example, Figure [7] depicts the optimal partition found by the 
greedy search algorithm for Binder’s loss (9 clusters), VI (4 clusters), and the 
modified VI (4 clusters) where the lower bound to the expected VI in Q is 
minimized. The four true clusters are visible in all solutions; however, Binder’s 
loss creates new small clusters for observations located on the border between 
clusters whose cluster membership is uncertain, overestimating the number of 
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(c) VI lower vertical bound: (d) VI horizontal bound: 

16 clusters 11 clusters 

Figure 10: Example 1: 95% credible ball with VI around c* (a) represented by 
(b) the upper vertical bound, (c) the lower vertical bound, and (d) the horizontal 
bound, where color denotes cluster membership. 


clusters. As expected, the B estimate and VI estimate achieve the lowest poste¬ 
rior expected loss for B and VI, respectively, but interestingly, the VI estimate 
has the smallest distance from the truth for both B and VI (see Table [lj. 

For the second example, the optimal partition found by the greedy search 
algorithm for Binder’s loss (12 clusters), VI (5 clusters), and the modified VI 
(4 clusters) are depicted in Figure [8] and compared in Table [2j Again, we 
observe that the four main clusters are present in all three point estimates, 
but Binder’s loss allocates uncertain observations on the borders to their own 
clusters, overestimating the number of clusters present. 

For the first example, Figures J9] and IT represent the 95% credible ball 
around the optimal partition for B and VI, respectively, through the upper 
vertical bound, the lower vertical bound, and the horizontal bounds, with data 
points colored according to cluster membership. Figures 11 and 12 provide an 
alternative visualization of the optimal partition and the bounds of the 95% 
credible ball. In these figures, the optimal partition and bounds are compared 
with the true partition through a color map of the N x N matrix with red 
indicating two data points in the same cluster for both the true and optimal 
partition; white indicating two data points in different clusters for both the true 
and optimal partition; green indicating two data points in the same cluster for 
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Observation Index 


(a) Posterior similarity matrix 


Observation Index 


(b) B estimate 



Observation Index Observation Index Observation Index 


(c) B upper vertical bound (d) B lower vertical bound (e) B horizontal bound 

Figure 11: Example 1: (a) Heat map of the posterior similarity matrix, (b) 
Optimal B partition compared with the true partition through a color map the 
N x N matrix with red indicating two data points in the same cluster for both 
the true and optimal partition; white indicating two data points in different 
clusters for both the truth and optimal; green indicating two data points in 
the same cluster for the truth and different clusters for the optimal; and blue 
indicating two data points in the same cluster for the optimal and different 
clusters for the truth. (c),(d),(e) Representation of the 95% credible ball with B 
through a color map of the N x N matrix comparing the bound with the truth 
(only one of two horizontal bounds shown for conciseness). 


the truth and different clusters for the optimal partition; and blue indicating two 
data points in the same cluster for the optimal partition and different clusters 
for the truth. Thus, red and white have a positive interpretation, while blue 
and green have a negative interpretation. Observations have been sorted by 
hierarchical clustering. Analogous plots for the second example are found in 
Figures [l3j [l4j [15] and [16] 

We observe that for the first example elements of the 95% credible ball 
with positive estimated posterior probability have at least four clusters for both 
metrics and at most 18 clusters for B or 16 clusters for VI, while the most 
distant elements contain 8 and 11 clusters for B and 11 clusters for VI. For 
both metrics, these bounds reallocate uncertain data points on the border or in 
some cases merge or split one of the four main clusters. In the second example, 
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Observation Index 


Observation Index 


(a) Posterior similarity matrix 


(b) VI partition estimate 



(c) VI upper vertical bound (d) VI lower vertical bound (e) VI horizontal bound 


Figure 12: Example 1: (a) Heat map of the posterior similarity matrix, (b) 
Optimal VI partition compared with the true partition through a color map the 
N x N matrix with red indicating two data points in the same cluster for both 
the true and optimal partition; white indicating two data points in different 
clusters for both the truth and optimal; green indicating two data points in 
the same cluster for the truth and different clusters for the optimal; and blue 
indicating two data points in the same cluster for the optimal and different 
clusters for the truth. (c),(d),(e) Representation of the 95% credible ball with 
VI through a color map of the TV x TV matrix comparing the bound with the 
truth. 


the green cluster in Figure[6b]is stable in all bounds, while the 95% credible ball 
reflects posterior uncertainty on whether to divide the remaining observations 
into 3 to 18 clusters for B and 2 to 15 clusters for VI. As a consequence of the 
asymmetric nature of B discussed in Section [ 3 J the number of clusters in the B 
upper or lower vertical bound is greater than or equal to the number of clusters 
in the VI upper or lower, respectively, vertical bound in all examples. 

Figures [TTJ [T2j [l5j and [16] provide a comparison of uncertainty represented by 
the posterior similarity matrix with the uncertainty represented by the 95% cred¬ 
ible ball. In general, the posterior similarity matrix appears to under-represent 
the uncertainty; indeed, one would conclude from the similarity matrix that 
there is only uncertainty in allocation of a few data points in Example 1. More¬ 
over, the 95% credible ball gives a precise quantification of the uncertainty. 
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(a) B estimate: 12 clusters (b) B upper vertical bounds: 4 clusters 
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(c) B lower vertical bound: (d) B horizontal bounds: 6 and 7 clusters 

19 clusters 


Figure 13: Example 2: 95% credible ball with Binder’s loss represented by (a) 
the upper vertical bounds, (b) the lower vertical bound, and (c) the horizontal 
bounds (only two of three horizontal bounds shown for conciseness), where color 
denotes cluster membership. 


Additionally, we note that for both metrics, there is greater uncertainty 
around the optimal estimate in Example 2; for Example 1, the 95% credible 
ball contains partitions with B-distance less than 0.097 or a Vl-distance of less 
than 0.841, while for Example 2, the 95% credible ball contains partitions that 
with a B-distance of less than 0.188 and a Vl-distance of less than 0.985. 

The greedy search algorithm was performed with different starting points 
and different values of l. which controls the amount of local exploration at each 
iteration. We experimented starting the search at the last MCMC sample or 
the MCMC sample which minimizes the criteria. The latter is clearly a better 
starting point, but requires additional computation for initialization. On the 
other hand, when initializing with the last sampled partition, more iterations 
are typically required to locate the optimal partition in the greedy search. In 
most cases, the algorithm converged to the same solution for both initializations, 
depending on the choice of l. The optimal partitions reported are found via the 
greedy search algorithm initialized at the MCMC sample which minimizes the 
criteria with l = 200. 

An advantage of the greedy search algorithm is that partitions not explored 
in the MCMC algorithm can be considered; for example, in all simulated and 
real examples, the B estimate is not among the sampled partitions and results 
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(c) VI lower vertical bound: (d) VI horizontal bound: 

16 clusters 10 clusters 


Figure 14: Example 2: 95% credible ball with VI represented by (a) the upper 
vertical bound, (b) the lower vertical bound, and (c) the horizontal bound, 
where color denotes cluster membership. 


in a lower expected loss than any sampled partition. Finally, we note that the 
optimal partition for VI and the modified VI are quite similar, differing in the 
allocation of one data point for the first example, a handful of data points in the 
second example, and no data points in the real example. However, computation 
time under the modified VI is significantly reduced by at least a 600 fold decrease 
in all examples. 


6.2 Galaxy example 

We consider an analysis of the galaxy data (Roeder 1990|), available in the 


MASS package of R, which contains measurements of velocities in km/sec of 82 
galaxies from a survey of the Corona Borealis region. The presence of clusters 
provides evidence for voids and superclusters in the far universe. 

The data are modelled with a DP mixture ([5]) . The parameters were selected 
empirically with = x,c = 1/2, a = 2, b = s, where x represents the sample 
mean and s 2 represents the sample variance. The mass parameter a is given a 
Gam(l, 1) hyperprior. 

With 10,000 samples after 1,000 burn in, the posterior mass is spread out 
over 9,636 partitions, emphasizing the need for appropriate summary tools. In 
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(a) Posterior similarity matrix 



50 100 150 200 


Observation Index 

(b) B estimate 



(c) B upper vertical bound (d) B lower vertical bound 


(e) B horizontal bound 


Figure 15: Example 2: (a) Heat map of the posterior similarity matrix, (b) 
Optimal B partition compared with the true partition through a color map the 
N x N matrix with red indicating two data points in the same cluster for both 
the true and optimal partition; white indicating two data points in different 
clusters for both the truth and optimal; green indicating two data points in 
the same cluster for the truth and different clusters for the optimal; and blue 
indicating two data points in the same cluster for the optimal and different 
clusters for the truth. (c),(d),(e) Representation of the 95% credible ball with B 
through a color map of the N x N matrix comparing the bound with the truth 
(only one of two upper bounds and one of three horizontal bounds shown for 
conciseness). 


Figure[l7j we plot the point estimate of the partition found by the greedy search 
algorithm for Binder’s loss and VI. The data values are plotted against the esti¬ 
mated density values from the DP mixture model and colored according cluster 
membership. Again, we observe that Binder’s loss prefers to place observations 
with uncertain allocation into singleton clusters, with the optimal partition con¬ 
taining 7 clusters, 4 of which are singletons, while the VI solution contains 3 
clusters. 

Table [3] compares the point estimates from the different criteria in terms of 
the posterior expected B and VI; as anticipated, the B solution has the smallest 
posterior expected B and the VI solution has the smallest posterior expected VI. 
Interestingly, the VI estimate and the optimal partition found by minimizing 
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Observation Index 


(a) Posterior similarity matrix 


Observation Index 


(b) VI estimate 



(c) VI upper vertical bound (d) VI lower vertical bound (e) VI horizontal bound 


Figure 16: Example 2: (a) Heat map of the posterior similarity matrix, (b) 
Optimal VI partition compared with the true partition through a color map the 
N x N matrix with red indicating two data points in the same cluster for both 
the true and optimal partition; white indicating two data points in different 
clusters for both the truth and optimal; green indicating two data points in 
the same cluster for the truth and different clusters for the optimal; and blue 
indicating two data points in the same cluster for the optimal and different 
clusters for the truth. (c),(d),(e) Representation of the 95% credible ball with 
VI through a color map of the TV x TV matrix comparing the bound with the 
truth. 


the lower bound to the posterior expected VI in Q are equivalent, while the 
latter requires significantly less computation time (a 2600 fold decrease). 

The 95% VI credible ball around the represenative VI partition contains all 
partitions with a VI distance less than 1.267. Figures 18 and 19 summarize 
the 95% credible ball through the upper vertical, lower vertical, and horizontal 
bounds. We observe a large amount of variability around the optimal partition. 
With 95% posterior probability, we believe that, on one extreme, the data could 
be modelled using only 2 components with a large variance for one component 
to account for outliers (red cluster in Figure (18a)). On the other extreme, the 
data could be further split in many, 15 to be precise, smaller clusters. Figure |19| 
emphasizes that the posterior similarity matrix under-represents the uncertainty 
around the point estimate. Following our comparison of B and VI in Section 
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EBL(c)= 0.2182 


Vl(c)= 0.9394 



(a) Binder’s loss: 7 clusters (b) VI: 3 clusters 


Figure 17: Galaxy example: optimal clustering estimate with color representing 
cluster membership for Binder’s loss and VI. The optimal solution for VI and 
the modified VI criteria are equivalent. 




(a) VI upper vertical bound: (b) VI lower vertical bound: (c) VI horizontal bounds: 

2 clusters 15 clusters 8 clusters 

Figure 18: Galaxy example: 95% credible ball with VI represented by (a) the 
upper vertical bound, (b) the lower vertical bound, and (c) the horizontal bound, 
where color denotes cluster membership. 



E[B|X>] 

E[VI|Z>] 

c* from B 

0.218 

1.014 

c* from VI 

0.237 

0.939 


Table 3: Galaxy example: a comparison of the optimal partition with Binder’s 
loss and VI in terms of posterior expected B and VI. The optimal solution for 
VI and the modified VI criteria are equivalent. 


[3] and the results of the simulated examples in Section 6.1, only the 95% VI 
credible ball is reported. 
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(a) Posterior similarity matrix (b) VI estimate 





(c) VI upper vertical bound (d) VI lower vertical bound 


(e) VI horizontal bound 


Figure 19: Galaxy example: (a) Heat map of the posterior similarity matrix, 
(b) Optimal VI partition (from the greedy search algorithm) depicted through 
a color map the binary N x N matrix with red indicating two data points in 
the same cluster for the optimal partition and white indicating two data points 
in different clusters. (c),(d),(e) Representation of the 95% credible ball with VI 
depicted through a color map of the binary N x N matrix. 


7 Discusssion 

Bayesian cluster analysis provides an advantage over classical cluster analysis, 
in that the Bayesian procedure returns a posterior distribution over the entire 
partition space, reflecting uncertainty in the clustering structure given the data, 
as opposed to returning a single solution. This allows one to assess statistical 
properties of the clustering given the data. However, due to the huge dimension 
of the partition space, an important problem in Bayesian cluster analysis is how 
to appropriately summarize the posterior. To address this problem, we have 
developed tools to obtain a point estimate of clustering based on the posterior 
and describe uncertainty around this estimate via the 95% credible ball. 

Obtaining a point estimate via a formal decision theory framework requires 
the specification of a loss function. Previous literature focused on Binder’s 
loss. In this work, we propose to use an information theoretic measure, the 
variation of information, and provide a detailed comparison of the two metrics, 
particularly focusing on their behavior on the lattice of partitions. We find 
that Binder’s loss exhibits peculiar asymmetries, placing a smaller loss on the 
partition which splits two equally sized clusters into many singletons compared 
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with a larger loss on the partition which merges these two clusters. The variation 
of information is more symmetric in this regard. This behavior of Binder’s loss 
causes the optimal partition to overestimate the number of clusters, allocating 
uncertain observations to their own cluster. In addition, we have also proposed a 
novel greedy search algorithm based on the Hasse diagram to locate the optimal 
partition, allowing one to explore beyond the space of partitions visited in the 
MCMC chain. 

To represent uncertainty around the point estimate, we construct 95% cred¬ 
ible balls around the point estimate and depict the credible ball through the 
upper vertical, lower vertical, and horizontal bounds. As opposed to the pos¬ 
terior similarity matrix, the 95% credible ball provides a precise quantification 
of the uncertainty present around the point estimate, and in examples, we find 
that an analysis based on the posterior similarity matrix leads one to be over 
certain in the clustering structure. 

The developed posterior summary tools for Bayesian cluster analysis will 
be shared through an R package ’mcclust.ext’, expanding upon the existing R 


package ’mcclust’ (Fritsch 2012 ), which contains tools for point estimation in 


Bayesian cluster analysis and cluster comparison. 

In future work, we aim to extend these ideas to Bayesian feature alloca¬ 
tion analysis, an extension of clustering which allows observations to belong to 


multiple clusters (see Griffiths and Ghahramani 2011 for an overview). 
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Appendix I: Proofs 


Proof of Property \3.I\ First note that if c > c, or equivalently cAc = c, then 
for all nonzero riij = j. Thus, if c > c, 
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( 6 ) 


Let rii denote the number of observations in cluster i under c; denote the 
number of observations in cluster i under c; and fL denote the number of ob¬ 
servations in cluster i under S. Then, from |el, if c > c > S, 
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For B the proof is similar, since if c > c, 
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Proof of Property 3.3 The meet between two clusterings c and c will have 
at most k N • kN clusters with ntj data points in cluster i,j for i = 1,..., fc/v, 
j = 1,..., fcjv (some riij may equal zero, resulting in less than kN • kN clusters). 
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The last line follows from the fact that t he pa irs (c, cAc) and (c, c A c) are 
vertically aligned, see the proof of Property 3.2 The proof for B is similar. || 


( 7 ) 


Proof of Property \3-4\ First note that from Property |3.2| 

d{ 1, 0) = d(l, c) + d( c, c A c) + d( 0, c A c), 

and 

d( 1, 0) = d(l, c) + d( c, c A c) + d(0, c A c). 

Combining 0 and Q, we have 

2d(l, 0) = d( 1, c) d(l, c) + d( c, c A c) + d( c, c A c) + 2d(0, c A c), 
which, using Property |3.3[ implies that 


( 8 ) 


d( 1,0) = - (d(l, c) + d(l, c) + d( c,c)) + d( 0, c A c). 
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Since d is a metric, the triangle inequality states that 

d{c , c) < d(l, c) + d(l, c). 

Adding d( c,c) to either side, we have 

d{ c, c) < i (d(l, c) + d(l, c) + d{ c, c)). 

Thus, 

d(c,c) <- (d(l, c) + d(l, c) + d( c, c)) + d(0, c A c) — d(0, c A c) 

= d( 1,0) - d(0, c A c) < d(l, 0). 

The last two statements results from VI(1, 0) = log (dV) andB(l,0) = 1—| 


Proof of 3.5 From Property 3.3 the distance between c and any c/c will be 
bounded below by the distance between c and their meet c A c, 

VI(c, c) > VI(c, c A c). 

If c Ac = c, then c > c, and there exists a c m such that c > c m y c. From 
Property |3.2| VI(c,c) > VI(c,c m ). Otherwise, c A c < c, and there exists a c s 
such that cAc < cA c. From Property |3.2| VI(c,cAc) > VI(c,c s ). Thus 
the closest partitions to c will be among those which cover c or those which c 
covers. 

Let’s first consider the partitions which cover c. If c m >~ c, then c m is 
obtained from c by merging two clusters i and j. and 


1 


VI(c,c m ) = — ((m + rij) log (m+rij) - n,log(n I ) - rij log (nj)) 


(9) 


which is minimized when i and j correspond to the two smallest clusters in c. 
Next consider the partitions which c covers. If c 3 -< c, then c s is obtained from 
c by splitting a cluster i of size rij > 1 into two clusters ii,i 2 of sizes n,;, and 
rii 2 , and 

VI(c,c s ) = (rij log (rij) - n il \og(m J - m 2 log (n» 2 )), (10) 

which is minimized when i is the smallest cluster of size rii > 1 and = 1 

and rii 2 = n* — 1. Thus, the closest partitions to c will be among those which 
merge the two smallest clusters or split the smallest cluster i of size rq > 1 into 
a singleton and a cluster of size rij — 1. 

To compare ([9]) and (10), we first consider the case when c contains at least 
two singletons. In this case, the closest partitions which cover c merge two 
singletons, and reduces to 

(21og(2) - llog(l) - llog(l)) = 

Letting i be the index of the smallest cluster of size rij > 1, the closest partition 
which c covers splits cluster i into a singleton and a cluster of size rij — 1, and 
(101 reduces to 

— (rij log(rij) - (rij - 1) log(rij - 1)). 
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For m = 2, these distances are equal, and the closest partitions to c will be 
those which merge any two singletons or split any cluster of size rij = 2. For 
rii > 2, the partitions which merge any two singletons are the closest. 

Next, suppose that c contains at most one singleton, and let n,, n 3 denote 
the sizes of the smallest two clusters, where rij corresponds to the size of the 
smaller cluster unless a singleton is present. The closest partitions which cover 
c merge any two clusters of sizes rij and n 7 -, and § is 

^ ((rij + rij) log(nj + rij) - n i log(n l ) - rij log (nj)) . (11) 


The closest partitions which c covers split any cluster of size rij into a singleton 
and a cluster of size rij — 1, (101 reduces to 


(rij log (rij) - (rij - 1) log(n l - 1)). 


( 12 ) 


In this case, (12) will be less than and the closest partitions are those 
which split the smallest cluster i into a singleton and cluster of size rij — 1. The 
proof for B is similar. § 


33 




